## -*- mode: html; coding: utf-8; -*-

## This file is part of CDS Invenio.
## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN.
##
## CDS Invenio is free software; you can redistribute it and/or
## modify it under the terms of the GNU General Public License as
## published by the Free Software Foundation; either version 2 of the
## License, or (at your option) any later version.
##
## CDS Invenio is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with CDS Invenio; if not, write to the Free Software Foundation, Inc.,
## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

<!-- WebDoc-Page-Title: The HEP taxonomy: rationale and extensions -->
<!-- WebDoc-Page-Navtrail: <a class="navtrail"
    href="<CFG_SITE_URL>/help/hacking">Hacking CDS Invenio</a> &gt;
    <a class="navtrail" href="bibclassify-internals"/>BibClassify Internals</a> -->
<!-- WebDoc-Page-Revision: $Id$ -->

<p>The DESY Library has been responsible for maintaining a thesaurus of high
energy physics (HEP) terms for a long time. The thesaurus is currently used as
a subject controlled vocabulary by HEP institutes worldwide. The need to
convert the HEP
<a href="http://www-library.desy.de/schlagw2.html">text thesaurus</a> to a
more complex taxonomy (in
<code>/opt/cds-invenio/etc/bibclassify/HEP.rdf</code>), with richer structure
and semantics, was mainly driven by the needs of the BibClassify system. The
current taxonomy of high energy physics is expressed in the
<a href="http://www.w3.org/2004/02/skos/">SKOS</a> syntax. SKOS is a dialect
of RDF and it is especially intended for the representation of knowledge
organization systems, such as thesauri, taxonomies and basic ontologies.</p>

<p><span class="adminbox">&nbsp;<b>NB.</b> The reasons behind the adoption of
SKOS, instead of other similar knowledge organization formats - notably
<a href="http://www.w3.org/TR/owl-features/">OWL</a> - are to do with the
simplicity, yet completeness, of SKOS. The SKOS language contains all the
basic properties that were needed to express the taxonomy and it allows
straightforward conversion from text to RDF. If you are interested in a more
detailed discussion on this matter, please check the
<a href="https://savannah.cern.ch/cookbook/?func=detailitem&item_id=112">CERN-DESY email correspondence</a>.
</span></p>

<p>In order to satisfy the needs of typical HEP classification schemes and
practices, the SKOS language had to be extended to include additional
properties. HEP keywords are often expressed as a combination - a pair - of
keywords. An example of this is:
<blockquote>
<code>Born-Infeld model: monopole</code>
</blockquote>
Here both <code>Born-Infeld model</code> and <code>monopole</code> are
standard HEP keywords (<em>single keywords</em>). However, they also combine
together to express a collective concept. We call it a
<em>composite keyword</em> and have extended the SKOS language in order to
include such new paradigm. The two property extensions created for this
purpose are:</p>
<ul>
<li><code>composite</code>: to express a relationship of <em>combination</em>
with another single keyword by pointing to the target composite keyword. It is
a subProperty of the SKOS property <code>narrower</code></li>
<li><code>compositeOf</code>: to express the relationship of dependence of a
composite keyword from two single concepts. It is a subProperty of the SKOS
property <code>broader</code></li>
</ul>

<p> These two extensions are probably best described by an example. The single
keyword <code>Born-Infeld model</code> is expressed in the HEP taxonomy as:
<blockquote><pre>
&lt;Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#Born-Infeldmodel"&gt;
  &lt;prefLabel xml:lang="en"&gt;Born-Infeld model&lt;/prefLabel&gt;
  &lt;hiddenLabel xml:lang="en"&gt;Born-Infeld&lt;/hiddenLabel&gt;
  &lt;altLabel xml:lang="en"&gt;DBI&lt;/altLabel&gt;
  &lt;broader rdf:resource="http://cern.ch/thesauri/HEP.rdf#fieldtheoreticalmodel"/&gt;
  &lt;composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelrelativistic"/&gt;
  &lt;composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelnonlinear"/&gt;
  &lt;composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelnonabelian"/&gt;
  &lt;composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelmonopole"/&gt;
  &lt;composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelchiral"/&gt;
&lt;/Concept&gt;
</pre></blockquote>
The concept contains all the usual SKOS tags to express the relations and
denominations of a concept (<code>prefLabel</code>, <code>broader</code>,
etc.). In addition, it contains five <code>composite</code> tags: these link
to five different combinations of this keyword with fellow single keywords.
For example one of these points to the composite keyword
<code>Born-Infeld model: monopole</code>, whose entry is:
<blockquote><pre>
&lt;Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelmonopole"&gt;
&lt;prefLabel xml:lang="en"&gt;Born-Infeld model: monopole&lt;/prefLabel&gt;
  &lt;compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#Born-Infeldmodel"/&gt;
  &lt;compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#monopole"/&gt;
&lt;/Concept&gt;
</pre></blockquote>
</p>

<p>The structure of single and composite keywords, as well as their
associations expressed by properties <code>composite</code> and
<code>compositeOf</code> are self-evident. By using such a model, we are able
to efficiently extract keyword pairs from fulltext, as explained in the
<a href="bibclassify-extraction-algorithm">BibClassify extraction guide</a>.
</p>

<p> Finally, it is worth pointing out a couple of other syntax practices that
might be specific only to the HEP taxonomy:</p>
<ul>
<li><code>hiddenLabel</code>: this lexical label (currently in unstable state)
is used primarily to define alternative displays of a keyword that are meant to
be hidden from the user, but are still accessible to text parsing operations.
In the HEP taxonomy, these include misspellings and wildcards (strictly
expressed as legal regular expressions).</li>
<li> <code>note</code>: this label is currently reserved to describe a
<code>nostandalone</code> condition - a property of those single keywords that
can only appear as part of composite keywords (their occurrence as standalones
is discarded).</li>
</ul>
